Exploring the gut DNA virome in fecal immunochemical test stool samples reveals associations with lifestyle in a large population-based study

Stool samples for fecal immunochemical tests (FIT) are collected in large numbers worldwide as part of colorectal cancer screening programs. Employing FIT samples from 1034 CRCbiome participants, recruited from a Norwegian colorectal cancer screening study, we identify, annotate and characterize more than 18000 DNA viruses, using shotgun metagenome sequencing. Only six percent of them are assigned to a known taxonomic family, with Microviridae being the most prevalent viral family. Linking individual profiles to comprehensive lifestyle and demographic data shows 17/25 of the variables to be associated with the gut virome. Physical activity, smoking, and dietary fiber consumption exhibit strong and consistent associations with both diversity and relative abundance of individual viruses, as well as with enrichment for auxiliary metabolic genes. We demonstrate the suitability of FIT samples for virome analysis, opening an opportunity for large-scale studies of this enigmatic part of the gut microbiome. The diverse viral populations and their connections to the individual lifestyle uncovered herein paves the way for further exploration of the role of the gut virome in health and disease.

Point by point response "Exploring the gut DNA virome in fecal immunochemical test stool samples reveals novel associations with lifestyle in a large population-based study" by Istvan, Birkeland et al.
We would like to thank the reviewers for their thorough and critical revision of our manuscript.In this rebuttal letter, we address each of the concerns and suggestions raised by the Reviewers and explain how we have revised the manuscript to address these points.
Both reviewers pointed out the added value of comparing stability of FIT samples with stool samples stored in buffers designed for microbiome analysis.We agree with this assessment, and we have now included new and independent sequencing data on paired stool samples from an Italian population sampled using both the FIT kit and the Norgen nucleic acid kit.In the revised version of the manuscript, we demonstrate that virome analyses based on FIT samples provided results that are very similar to those of standard sampling procedures for gut microbiome analyses at sequencing coverages employed for in the current study (results: 124-137, figure 2c-d, supplementary figure 2).To obtain samples and conduct metagenome sequencing of the newly analyzed samples, we have relied on our collaborators Barbara Pardini and Sonia Tarallo at the Italian Institute for Genomic Medicine in Turin, who we now wish to include as co-authors for this paper.Their inclusion as co-authors has been approved by all other co-authors of the manuscript.In addition to changes made in response to reviewers, some minor changes have been made to the manuscript, and the methods section has been moved to the end of the manuscript.We have also updated the data availability statement.All text changes from the initial version are in red font.
We have also changed the title of the manuscript according to suggestion from the reviewer: "Exploring the gut DNA virome in fecal immunochemical test stool samples reveals novel associations with lifestyle in a large population-based study" We hope that the revised manuscript and our responses provided below will address the issues and concerns raised by the reviewers.

Reviewer #1 (Remarks to the Author):
All published research articles on the fecal microbiome and especially the fecal virome have used either fresh stool samples or stool samples preserved in buffers that are designed for stabilization of the fecal microbiome.This new research paper tried to prove that the DNA virome could be analyzed in the stool collected in the FIT sampling tubes which have been widely used in national bowel cancer screening programs among different countries.Thus, once validated, the research method of this paper may enable future large-cohort clinical studies with the existing infrastructure of the bowel cancer screening program including easy transportation of FIT tube and recruitment of participants.This paper could be improved by addressing the following issues: 1.The authors suggested that FIT tubes could be a suitable sampling method for virome characterization.However, this study only investigates the DNA virome in stool collected from FIT tubes but did not compare these data with those data collected from the current methodologies ie.either fresh stool samples or stool samples preserved in buffers that are designed for the stabilization of the fecal microbiome.Including this comparison will strengthen this hypothesis.
Response: We thank the reviewer for pointing out this limitation of our study.To address it, we have now included metagenomic sequencing data derived from stool samples from 7 individuals.The samples were collected using both FIT sampling kits and the widely used Norgen nucleic acid collection and transport tubes (Norgen Biotek Corp) DNA extraction controls resulted in the generation of a total of 32 QC sequencing reads, whereas sequencing of the library prep negative controls resulted in a total of 3 reads.This information has now been added to the main text (lines 481-484).
6. On page 9, line 271, Peduoviridae did not have a significant difference in "organic nitrogen" but the author claims that it is largely increased.This is not an appropriate description.
Response: We apologize for the confusion.What we meant to convey was that the "organic nitrogen" group of AMGs were almost exclusively found in genomes of the Peduoviridae and those without family annotations.While no formal statistical test was reported in support of this claim, as the statistical testing was performed for one family group versus all others, the numbers are clear in this regard: among genomes of the Peduoviridae family and those without family annotation, 13.5% (n = 2323 in total) carried an "organic nitrogen" AMG, whereas of the remaining genomes, 0.7% carried "organic nitrogen" AMGs (n = 7 in total).We have modified the text to make our intended meaning clearer (line 192).4c, Do the genome integrations of unintegrated families have a significant decrease statistically as compared to those of the other families?

in Figure
Response: Yes, these are indeed highly statistically significant.This information is now presented in Supplementary Table 5. 8.In Figure 5 legend, gray should be grey.

Response: done
9. This paper only analyzed the fecal DNA virome.So the title should be changed from "gut virome" to "fecal DNA virome".
Response: Indeed, we have only investigated the DNA virome and we revised the title accordingly.However, we would argue against using 'fecal DNA virome' instead of 'gut DNA virome'.Stool samples are a very widely used proxy for gut microbiome studies, and almost all our collective knowledge on the gut microbiome stems from the analysis of stool samples rather than invasive sampling from the gastrointestinal tract.
10.It would be good to include the Ethics approval information such as ethics approval number..etc.
Response: We apologize for not including this information in the first draft of the manuscript.The CRCbiome project was approved by the Norwegian Regional Committees for Medical and Health Research Ethics (Approval no.: 63148).The MITOS study from which paired stool samples are derived, was approved by the local Ethics committee (AOU Città della salute e della Scienza di Torino, Italy).This information is now added to the Ethical Approval section of the revised manuscript (lines 586-589).

What does the color represent in Suppl. Fig 3?
Response: The color represents the difference in prevalence (given in percent) of a given AMG category in vOTUs associated with a given lifestyle factor and that of all vOTUs.Red shading indicates higher prevalence in the differentially abundant vOTUs, and blue shading indicates lower prevalence in the differentially abundant vOTUs.This information is now added to the figure legend (now Supplementary Figure 5).
12. In Suppl Fig5, does the "original effect size" mean "Effect size with CRC"?Suppl Response: Original effect size does indeed include CRC cases.We agree that there is insufficient information for the interpretation of this figure and have therefore added a more detailed description to the figure legend (now Supplementary Figure 6).

Reviewer #2 (Remarks to the Author):
The authors describe characterization of the virome in stool samples collected for colorectal cancer screenings.The study includes a large number of samples ( 1034), which is its most distinct accomplishment.

Major points:
The authors extracted DNA from samples stored in the collection buffer and demonstrate that they can then sequence the DNA and identify viral sequences.However, they authors only evaluate DNA, which should be noted in the title and abstract, since there are important RNA viruses in the gut.
Response: Thank you for raising this point.We have now specified that we only study DNA virome both in the title and the abstract.
The authors also include that extracting DNA from the collection buffer for sequencing works well, but there is no comparison with samples collected in a more optimal way.It is possible that the collection method has introduced biases in detection, and that limitation should be mentioned.
Response: We thank the reviewer for this suggestion.We have now added pairwise comparison between stool samples stored in the FIT buffer and in the Norgen buffer designed for DNA stabilization in fecal samples; please see also response to point 1 of Reviewer 1. Results from this comparison are presented in lines 124-137, in figure 2c-d, and in supplementary figure 2. Some of the methods are a little unclear.It is a little unusual that libraries were size selected after pooling.The insert size is also a little larger than I might have expected for NovaSeq sequencing.What was the method of size selection on the large pool of 240 libraries?Why was that size range selected?
Response: The sequencing was performed at the FIMM Technology Centre (University of Helsinki, Finland) according to the standard library prep procedure and recommendations from the sequencing centre lab.AMPure XP (Beckman Coulter; line 478) was used for the pooled amplicons clean-up.We did not have 240 libraries, rather each library comprised 240 samples -a number arrived upon to achieve a target of 3Gb sequencing data per sample.
What databases were used for alignments?Parameters?Were translated alignments included in assigning taxonomy?Were bacterial sequences screened out?Without screening, false positives can sometimes be observed.How did the authors limit false positives?
Response: Viral sequences were extracted from Virsorter2-predicted viral contigs using CheckV.Based on CheckV assessments, these were filtered to include only those of medium quality, high quality, or those that were categorized as complete.We used INPHARED for construction of a reference database for taxonomy assignment.INPHARED is a tool for the automatic download and filtering of complete and near-complete phage genomes from GenBank (NCBI).In the revised version of the manuscript, we further used a broader viral database that included viruses with hosts other than bacteria/archaea (see response below).vConTACT2 was used for inference of viral taxonomy, which is accomplished by construction of gene-sharing networks based on translated protein sequences.vConTACT2 was run with default parameters -details are given in line 534.As a control for the false positives introduced at the data generation steps, we had negative controls from DNA isolation (n=6) and library preparation (n=2).The DNA extraction controls resulted in the generation of 32 QC sequencing reads, whereas sequencing of the library prep negative controls resulted in 3 reads (lines 481-484).In addition, we included two positive control samples using ZymoBIOMICS Microbial Community Standard with known composition, intended for ensuring the accuracy and quality control of microbiome analyses.Subjecting these standards to the analytic procedure used in the study resulted in the identification of 15 prophages, all assigned to taxa that infect the bacteria included in the community control (Supplementary Table 10).The fact that all viruses were specific to the community standard suggests no or low level false positive viral detection.This information is now added in lines 537-540.
What about human viruses?It is surprising to not see more representation of human viruses in this large data set.If these are missed because of low abundance and failure to assemble, other analysis approaches may need to be employed to ensure they are being included and the data are representative of the viral communities.
Response: Thank you for the suggestion.We have now included annotation to eukaryotic viruses in the updated version of the manuscript.We are unaware of any comprehensive human gut eukaryotic viral database.We therefore employed the Virus-Host database that covers RefSeq and GenBank deposited viruses and includes manually curated information on host retrieved from GenBank, RefSeq, UniProt, ViralZone and literature surveys (https://www.genome.jp/virushostdb/).Six vOTUs (0.02%; Supplementary Table 3) had the closest protein similarity to the eukaryotic viruses.One of them was classified as human papilloma virus 6 (CRCbiome_vOTU05832).We therefore also mapped sequencing reads to the PaVE: Papillomavirus Episteme database (https://pave.niaid.nih.gov).Read mapping confirmed that HPV6 was the only type of HPV identified, and that this was restricted to only one individual.Two vOTUs were also assigned taxonomy based on the clustering with the INPHARED bacteriophage database.One vOTU clustered with the Flyfo siphovirus Tbat1_6 using either database.This virus is a bacteriophage isolated from the feces of Pacific Flying Fox (https://journals.asm.org/doi/10.1128/mra.00038-22),and the host was erroneously annotated in the Virus-Host database.The other vOTU was indirectly clustered both to Fadolivirus algeromassiliense (an amoeba virus targeting Vermamoeba veriformis) when using the Virus-Host database and to the Acinetobacter phage MD-2021a when annotating against the INPHARED database.Both references had low similarity to the vOTU on the nucleotide level.Given these uncertainties in host annotations, reference databases' scarcity, and generally low abundance of eukaryotic DNA viruses in the gut, we decided to focus on the phage virome.These results are now provided in lines 152-166.
While not necessary to address this point, this paper would have more impact if the bacterial viruses were related to bacterial communities/bacterial hosts.
Response: We agree with the reviewer that bacteria/viral interactions are a very interesting topic to investigate.However, this question is so broad and requires such a comprehensive description that we believe it is better addressed in a separate manuscript.

Response: The text has been changed, and now reads 'employs BBTools utilities' (line 498).
Fig 5 also needs a longer description of what the sub-figure represents.
. These individuals were recruited in the frame of the regular Piedmont Region CRC screening program in the Microbiome and MiRNA in Torino Screening (MITOS) project (https://doi.org/10.1053/j.gastro.2023.05.037; https://doi.org/10.1186/s12943-023-01869w).The Piedmont Region screening program invites all residents aged 59-69 to undergo a single sample biennial FIT test.Both FIT and Norgen samples were stored at -80 C for 3-5 years before DNA extraction.While these samples are limited in number, they show that usage of the FIT for virome is not inferior and very similar to that of Norgen sample collection.The results are summarized in lines 124-137 and are presented in figure 2c-d and in supplementary figure 2. Sections in materials and methods have also been added, describing these samples.